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Abstract 



We solve the Hubbard model with the exact diagonalization method on a graphics processing unit (GPU). We benchmark our 
GPU program against a sequential CPU code by using the Lanczos algorithm to solve the ground state energy in two cases: a 
^ one-dimensional ring and a two-dimensional square lattice. In the one-dimensional case, we obtain speedups of over 100 and 60 in 
jingle and double precision arithmetic, respectively. In the two-dimensional case, the corresponding speedups are over 110 and 70. 
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J. Introduction 

The Hubbard model was introduced in the 1960s to describe 
interacting electrons in a solidldii. It has since been the 
subject of extensive study and is still a source of interesting 
physicsfl. It is perhaps the simplest model to display many of 
"the essential features of electron correlations, such as ferromag- 
netism and conductor-Mott-insulator transition. 

In the Hubbard model, the solid is described by a fixed lat- 
tice, where electrons can hop from one lattice site to another 
The electrons are always bound to an atom, such that their wave 
functions are vectors whose components squared are the prob- 
abilities of finding the electron at the corresponding lattice site. 
Interactions take place only between electrons that are residing 
on the same site. 

■ The Hamiltonian can be written as 



H 



Hhop + H, 



ml 



(1) 
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.where < ij > denotes a sum over neighboring lattice sites, c! 
and Ci cr are the creation and annihilation operators which re- 
spectively create and annihilate an electron at site / with spin 
tr, and riijj- - Cj u- counts the number of such electrons. U 
is the interaction energy and t is the hopping amplitude. The 
creation and annihilation operators obey the typical anticom- 
mutation rules for fermions. 



^1' ^a) = ^U^'TT and [cl, c)J = [a^, cjr] = 0, (3) 



which means that there are four possible occupations for a lat- 
tice site: either it is empty, has one up electron, one down elec- 
tron or one of each. 



An important property of the Hamiltonian is that the num- 
bers of both up and down electrons are separately conserved. 
This is convenient because it allows one to fix the number of up 
and down electrons and thus restrict to a subspace of the whole 
Hilbert space. 

Despite the model's simplicity, an analytic solution is only 
available in one dimension, and it was found by Lieb and Wu 
in I9681I5I]. In general, computational methods are required. 
While both terms in the Hamiltonian are easy to diagonalize 
separately, their sum is highly nontrivial. One method to nu- 
merically solve the Hubbard model is exact diagonalization. 
The idea is to simply calculate the matrix elements in a suit- 
able basis and then diagonalize the resulting matrix. The ob- 
vious downside of this approach is that the number of lattice 
sites and particles that can be considered is quite low due to the 
very rapid growth of the dimension of the Hamiltonian matrix 
as a function of the system size. However, the major advantage 
is that the results are exact up to numerical accuracy, which 
makes exact diagonalization well-suited to situations where a 
perturbative solution is not possible. It can also be used to test 
the reliability of other, approximative methods by comparing 
their results with the exact ones. 

2. Exact diagonalization 

2.1. The Hamiltonian 

To calculate the matrix elements of the Hubbard Hamiltonian 
in Equation (|2|, we choose a simple basis where the lattice sites 
are numbered from upward and the basis states correspond to 
all the ways of distributing the electrons in the lattice. For ex- 
ample, if we have A^, = 4 lattice sites, N-j = 2 spin up electrons 
and Ni = 3 spin down electrons, then 



c c' c c' c' \0) 



(4) 
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is one basis state. In state (|4|, the up electrons reside on sites 
and 2 and the down electrons on sites 0, 2 and 3. The empty 
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lattice into which the electrons are created is denoted by \0). 
To resolve any ambiguities arising from different orderings of 
the creation operators in Equation (|4), we define that in the ba- 
sis states all spin up operators are to the left of all spin down 
operators, and site indices are in ascending order 

The dimension of the Hamiltonian matrix is equal to the 
number of ways of distributing A^^ spin up electrons and Ni 
spin down electrons into Ns lattice sites, i.e. 



dim// 



(5) 



The size of the basis grows extremely fast. For example, in 
the half-filled case where N-[ - Ni - Ns/2, for 12 sites 
dim// = 853776, for 14 sites dim// ^ 11.8 x 10*' and for 
16 sites dim// ^ 166 x 10*'. In addition, the matrices are very 
sparse, because the number of available hops, and thus the num- 
ber of nonzero elements in a row, grows only linearly while the 
size of the matrix grows exponentially. 

To form the Hamiltonian, we need to label and order the basis 
states. A convenient way to do this is to represent each state 
with binary numbers such that occupied and unoccupied sites 
are denoted by 1 and 0, respectively. For example, the state in 
(|4]i becomes 



''0T''2T''0i''2i''3i 
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up down 



(6) 



Note that in our convention site indices run from right to left in 
the binary number representation. 

A simple way to order the basis states is to do it according 
to the size of the binary number that represents the state. Using 
this scheme, if we index the states by J, the conversion from 
the binary representation is given by 



J = i 



(7) 



where and ii are the positions of the up and down configu- 
rations in an ordered list, starting from 0, of all A^5-bit numbers 
with Ni and Ni bits set, respectively. To clarify, for example 
in (|6]l, the possible up configurations, in order, are 0011, 0101, 
0110, 1001, 1010 and 1100, so 0101 is the second configura- 
tion, and /| - 1. Similarly, 1101 is the third 4-bit number with 
3 bits set, so /| - 2. Thus, we get 



J 



ixl U2 



(8) 



which is confirmed by Table [T] 

Forming the Hamiltonian matrix is now straightforward. The 
interaction part //„» is diagonal, and it essentially just counts 
the number of doubly occupied lattice sites and increases the 
energy by U for each instance. The matrix elements of the hop- 
ping part, Hhop, are +t between basis states that can be reached 
from each other by a hop of a single electron, and vanish other- 
wise. 

For example, if we have a one-dimensional lattice with peri- 
odic boundaries, from Table [1] we see that 
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Table 1: A scheme for labelling the basis states for = 4, = 2, 
Ni = 3. States are ordered first according to the up spin configura- 
tion (first column) and then according to the down spin configuration 
(second column), in ascending order. 



because |6) can be reached from |2) when the up electron at site 
1 hops to site 2. From |3) to |6) it takes two hops so the matrix 
element vanishes. Because of the binary number representation 
of the basis states, in the computer these calculations can be 
conveniently done with integers and bitshift operations. 

The signs of the nonzero matrix elements are determined by 
the anticommutation relations of the creation and annihilation 
operators. An extra minus sign is picked up for each electron 
of the same spin that is hopped over. So in the one-dimensional 
case, if the hop is over the periodic boundary and the total num- 
ber of electrons of the same spin is even, the matrix element 
changes sign. For example. 



(61 //,,„/, 122) = f and (0| //,,„,, |3) = -f 



(10) 



\{2\HHop\6}\ = t and (3 1 //,,„„ |6) = 



(9) 



because there is an even number of up spins and an odd number 
of down spins (note the minus in the Hamiltonian). Note that 
the method is completely general and applies to any kind of 
lattice and any number of electrons. In a general lattice, the 
sign is determined by the number of electrons of the same spin 
residing in lattice sites whose labels are between the labels of 
the origin and the destination of the hop. 

2.2. The Lanczos algorithm 

It is apparent that fully diagonalizing the Hamiltonian using 
standard methods is only practical for very small systems. For- 
tunately, we are usually mostly interested in the ground state 
and the lowest excited states. A well known method for accu- 
rately approximating the lowest eigenvalues and eigenstates of 
sparse matrices is the Lanczos algorithm^ . 

The idea of the Lanczos algorithm is to project the Hamilto- 
nian onto an orthogonaUzed basis in a Krylov subspace, defined 
by 

%n(f, H) = span(/, ///, //V, . . . , H"'-7), (11) 

where / is a random starting vector and m is the dimension of 
the subspace. The result of the Lanczos procedure is a tridiag- 
onal matrix, i.e. one with nonzero elements only on the main 
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Algorithm 1 The Lanczos algorithm 161] 



Require: a random initial vector f\ of norm 1 



for j - \ io m do 

'ij ^ Hfj - bjfj^i 

Ij ^ - ''jfj 

if bj+i - then 

Stop 
end if 

//■+! ^ 
end for 



diagonal and the first sub- and superdiagonals. The dimension 
of the Krylov space does not have to be known beforehand. 
Instead, one can check for convergence after each iteration, for 
example by calculating the difference in the ground state energy 
from the previous iteration. The lowest (and highest) eigenval- 
ues accurately approximate the corresponding eigenvalues of H 
already for m «: dim H . 

One way to write the Lanczos algorithm is, in pseudocode, 
given by Listing 1 . The algorithm generates the so called Lanc- 
zos basis, {/i,/2, ...,/„,], in the Krylov space, and the projec- 
tion of H in this basis is given by the generated constants aj and 



bj as 



T = 



a\ b2 
b2 a2 

■. 















bm-\ flm-l b„ 

•■• b,„ a,„ / 

Since T is tridiagonal, symmetric and typically much smaller 
than H, its eigenvalues are easy to calculate with standard meth- 
ods. 

Since memory is a scarce resource, especially on the GPU, 
we would like to use as few vectors as possible. Superficially, it 
seems like three vectors are needed for the Lanczos algorithm 
because of the three-term recurrence relation on line 4 of Algo- 
rithm [T] 

qj^Hfj-bjfj^,. (12) 

However, we can manage with only two vectors by splitting this 
into 

/i-i ^ -^/i-i (13) 

fj^i^fj^i+Hfj m) 

and then renaming to qj. 

3. GPU computing 

3.1. Introduction 

Graphics processing units (GPU), originally developed to aid 
the central processing unit (CPU) in rendering graphics, have 



b,„ 



evolved into powerful computational engines, capable of gen- 
eral purpose calculations. In recent years, they have been in- 
creasingly used in a variety of scientific disciplines, including 
physics, to speed up computations that benefit from the archi- 
tecture of the GPU. 

The GPU is quite different than the CPU. Simply put, while 
the CPU performs few concurrent tasks quickly, the GPU exe- 
cutes a very large number of slower tasks simultaneously, i.e. in 
parallel. Although modern CPUs typically consist of multiple 
cores, allowing parallel computation to some extent, the scale 
of parallelization in the GPU is orders of magnitude larger, up 
to tens of thousands of simultaneous threads in modern cards. 

To benefit from GPUs, the problem has to be suitable for 
large scale parallelization. In addition, specifics of the GPU ar- 
chitecture need to be taken into account. For example, to get a 
performance gain, the program should have high arithmetic in- 
tensity, defined as the number of arithmetic operations divided 
by the number of memory operations. This is because accesses 
to the memory have a high latency and it is therefore desirable 
to hide this latency with calculations. Also, data transfer be- 
tween the CPU and the GPU through the PCI Express bus is 
slow, and should therefore be minimized. 



3.2. CUDA 



Compute Unified Device Architecture (CUDA) is a parallel 
computing architecture by the GPU manufacturer NVIDIA. It 
allows programmers to utilize GPUs with CUDA capability in 
general purpose computation through the use of an extension of 
the C programming language. 

CUDA programs consist of the main program that runs on the 
CPU {the host), and special functions called kernels that run on 
the GPU {the device). Since the GPU has a SIMD architec- 
ture, kernels are written from the viewpoint of a single thread. 
Threads are organized into groups that are called blocks. 

When a kernel is launched from the host code, the program- 
mer specifies the number of blocks and the number of threads 
per block in the kernel call. Each thread has its identification 
number within the block stored in an internal variable threa- 
dldx, which can be one-, two- or three-dimensional. Similarly, 
the block ID is stored in a variable called blockldx. With these 
ID numbers the programmer can assign the threads to different 
data. 

Threads can be synchronized within a block, but blocks have 
to be able to execute independently. Within a block, threads 
have a common fast memory space called shared memory that 
can be used as a user-managed cache. Optimally, one would 
like to read the relevant data from the slow global memory into 
the fast shared memory, perform the calculation there and write 
the result back to the global memory. In the latest generation of 
NVIDIA GPUs, called Fermi, there are also automatic LI and 
L2 caches for global memory accesses. For a more thorough 
overview of CUDA, we refer to Ref. 
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4. GPU implementation 

4.1. Overview 

In the following subsections, we will describe the details of 
our GPU implementation. The general structure of the program 
is that of a typical GPU application: first, we prepare the data, 
such as the Hamiltonian matrices, on the host. Then, we allo- 
cate memory on the device and transfer the data to the global 
memory of the GPU. Computations are then done on the GPU 
by launching the necessary kernels, and finally the results are 
transferred back to the host. 

Our application runs on a single GPU. A multi-GPU imple- 
mentation, while certainly feasible, would not offer significant 
benefits from a physical point of view. This is because the re- 
quired amount of memory grows so rapidly with the system 
size that even with a large number of GPUs, we would not 
have access to significantly larger systems compared to a single 
GPU. Additionally, since the communication between GPUs 
currently has to take place through the host system, it might 
be difficult to achieve good multi-GPU performance because of 
the large amount of communication needed in the Hamiltonian 
times vector operation. 

4.2. Storing the Hamiltonian 

The overwhelmingly most time-consuming part of the Lanc- 
zos algorithm for large systems is the operation on a vector with 
the Hamiltonian on line 4 of Algorithm [T] For small matrices, 
this could be implemented as simple matrix-vector multiplica- 
tion. Indeed, there have been previous GPU implementations 
of the Lanczos algorithm relating to graph bisection and image 
segmentation!^ as well as latent semantic analysisl9|l. In these 
works, the matrix sizes have been small enough to allow a direct 
computation of the matrix-vector product. 

However, as discussed in Subsection 12.11 our Hamiltonian 
matrix becomes very large already for systems with over 10 lat- 
tice sites. The newest Fermi generation of NVIDIA graphics 
cards have up to 6 GB of memory, so there is no hope of stor- 
ing the full Hamiltonian in the memory of the GPU. With the 
interaction part this is not a problem because it can be easily 
generated on the fly, but with the hopping part we would prefer 
to have it precomputed. 

To compress Hi,op, we first split it into a Kronecker sum of up 
and down hopping Hamiltonians, which operate only on elec- 
trons with the corresponding spin: 



Hi 



hop 



H-^®Hi- H-^®Il+ I-^® Hi, 



(14) 



where la- is the identity operator for electrons with spin cr and ® 
denotes the Kronecker product of matrices, which corresponds 
to the tensor product of linear maps, and is defined as follows: 
If A is a m-hy-n matrix with elements a/j, then the Kronecker 
product of A and another (arbitrary sized) matrix £ is a block 
matrix defined by 
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indices = (0, 1, 0, 1, 1, 2, 2, 2, *, 3, *, *) 

Figure 1: The ELL foiTnat produces two smaller matrices from the 
initial sparse matrix. In practice, these will be converted to vectors in 
column-major order. The stars denote padding and they are set to zero. 



Further, because Hi and Hi are still very sparse, we can 
store them efficiently by using a standard sparse matrix for- 
mat. We choose the ELL format ifioll . in which the sparse ma- 
trix is compressed horizontally into a dense matrix. The width 
of the resulting matrix is determined by the maximum num- 
ber of nonzero elements per row in the original matrix. An- 
other matrix of the same size contains the column indices of 
the elements. ELL is well-suited for matrices whose number of 
nonzero elements per row does not change very much because 
that means that little space is wasted to padding. This is the 
case with Hi and Hi. 

As shown in Fig. [T] the ELL formatted matrices are stored in 
the memory column-wise. This is because in the matrix-vector 
product each dot product of a row and the vector is done in par- 
allel. Thus, in the first iteration, the threads access the elements 
of the first column, in the second iteration the second column 
etc. When the matrix is stored column-wise, these memory ac- 
cesses are contiguous, resulting in better memory bandwidth. 

After the splitting of Hhop into Hi and Hi and storing them 
with the ELL format, the size of the Hamiltonian is no longer 
a problem. Instead, the memory consumption is almost solely 
determined by the number of stored state vectors. The com- 
plication is that operating on a vector with Hhop is no longer 
simple matrix-vector multiplication. 

4.3. The kernel 

The core of our GPU implementation of the exact diagonal- 
ization procedure is the kernel that computes the product of the 
Hamiltonian and a state vector It naturally consists of three dif- 
ferent parts: operating with the up part of H^op, the down part 
of Hhop and //,„,. To avoid unnecessary memory accesses, it is 
beneficial to do them all in the same kernel. 

To see what the spin up hopping Hamiltonian does to a state, 
it is useful to think of the vector as consisting of dim Hi sub- 
vectors of length dim Hi. The basis states within each subvector 
have the same up spin configuration. It is then straightforward 
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Algorithm 2 The kernel pseudocode for operating with the spin 
up hopping Hamihonian 

Require: blockID {the block index) 

Require: sv {the subvector index) 

Require: id {the thread index within the subvector) 

Require: blockID < dimUp * blocksPerSubvector 

Require: id < dimDn 



sum ' 







if threadldx.x < numcolsUp then 

Ax5[threadldx.x]<— AxUp[threadIdx.x*dimUp + sv] 
Aj 5 [threadldx.x] «— AjUp[threadIdx.x*dimUp + sv] 

end if 

syncthreads 

for ; = to numcolsUp do 

if Ax, + then 

sum <— sum + Axs[i] * x[Ajs[i] * dimDn + id] 

end if 
end for 



X - 



Xi 



-'^dim Hi 
■*^dimH^ + l 



j(0) 
„(1) 



jjCdimHt-l) ^ 



■'CdimWi dimH|-l j 

Figure 2: The state vectors can be thought to consist of subvectors of 
constant up spin configuration. 



to show that operating with the spin up hopping Hamiltonian 
is just like normal matrix- vector multiplication on the vector of 
subvectors: 



j(0) 

^(1) 



(dimHt-1) 



(15) 



Implementing this on the GPU is rather straightforward be- 
cause we are essentially just copying parts of the input vector 
to other parts in the output vector Therefore, we will launch 
dim Hi threads per row that calculate the dot product of that 
row with X. The kernel code can be seen in Algorithm|2] 

The kernel receives the two ELL formatted matrices as in- 
puts, and inside the kernel. Up and Dn in variable names refer 
to and H^, respectively. The variables dim contain the di- 
mensions of the Hamiltonians, and numcols have the number 
of columns of the ELL representations. The ELL formatted 



Algorithm 3 The kernel pseudocode for operating with the spin 
down hopping Hamiltonian 

Require: blockID {the block index) 

Require: sv {the subvector index) 

Require: id {the thread index within the subvector) 

Require: blockID < dimUp * blocksPerSubvector 

Require: id < dimDn 



for / = to numcolsDn do 
Aij <- AxDn[i * dimDn + id]; 
col <— AjDn[i * dimDn + id]; 
sum <— sum + Aij * x[sv * dimDn ■ 

end for 



col] 



Algorithm 4 The kernel pseudocode for operating with the in- 

teraction Hamiltonian 

Require: gID {the global thread index within the vector x) 
Require: U {the interaction strength) 
Require: statesUp {the spin up basis states) 
Require: statesDn {the spin down basis states) 



if U then 

doubles <— statesUp[gID/dimDn] & statesDn[gID 
mod dimDn] 

intE «- U * BitCount(doubles) 
sum <— sum + intE * x[sv * dimDn -i- id] 
end if 



matrices are in the arrays Ax (the data vector in Figure [I] and 
Aj (the indices vector in FigurelTJ- 

Due to hardware limitations, we might need one or more 
blocks per subvector in order to have one thread per element 
of X. In the beginning of the kernel, we calculate a couple of 
helpful variables, the subvector index of the current thread, sv = 
blockID / blocksPerSubvector (integer division) and the thread 
index within the subvector, id = (blockID mod blocksPer- 
Subvector) X threadsPerBlock + threadldx.x. 

After discarding any extraneous blocks or threads, we load 
the elements of the appropriate row of the spin up hopping 
Hamiltonian into shared memory arrays Ax, and Aj, on lines 
4 and 5 and synchronize afterwards. Finally, we loop over the 
Hamiltonian and accumulate the result in the register variable 
called sum. 

Operating with the spin down part of the hopping Hamilto- 
nian is straightforward. We just compute an ordinary matrix- 
vector product with each subvector: 



m®H{}x^ 



(16) 



The kernel code for this can be seen in Algorithm |3] 

Finally, we operate with the interaction Hamiltonian (Algo- 
rithm |4|i, which is generated on the fly from the vectors state- 
sUp and StatesDn that contain the integers corresponding to the 
binary representations of the up and down basis states, respec- 
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lively. First, we generate a binary number where set bits corre- 
spond to doubly occupied sites by doing a bitwise and operation 
between the up and down states on line 2. Then, we count the 
number of bits set and multiply by U and increment the vari- 
able sum by the product of the interaction matrix element and 
the corresponding element of x. 

After that, we are done. The final result of the Hamiltonian 
times X operation has now been accumulated by each thread 
into sum, and it can be written into an output vector in the global 
memory of the GPU. 

In the above, we have for simplicity assumed that each thread 
calculates one element of the result vector A simple, yet per- 
haps counter-intuitive optimization is to increase the workload 
of the threads. This is especially helpful in single precision. 
According to our experiments, best results can be achieved by 
calculating four elements per thread, i.e. four subvectors per 
block. While this increases register and shared memory usage, 
which leads to smaller occupancy, an impressive reduction of 
up to 50% in the execution time of the kernel can be obtained 
compared to calculating only one element per thread. 

5. Performance 

To measure the performance of our GPU implementation, 
we compared the execution time of the Lanczos algorithm on 
a Tesla M2070 GPU against a serial CPU program running on 
a Intel Xeon X5650. The algorithms in the two implementa- 
tions are identical, including the modification in Equations ( fTSl l. 
The validity of our GPU code has been checked by starting the 
Lanczos algorithm with the same starting vector in both codes, 
and verifying that the produced matrix elements aj and bj of the 
tridiagonal matrix in Equation (I2.2l i are identical up to floating 
point accuracy. In the CPU code, the hop ping Hamiltonian is 
stored using the CSR sparse matrix format 11 lOl instead of ELL, 
for better performance. 

We measured the time of a single iteration of the loop in 
Algorithm [U in two cases: a one-dimensional ring and a two- 
dimensional square lattice with periodic boundary conditions. 
The ID system was studied as a function of the system size 
while keeping the lattice half-filled, i.e. N-\ - N[ - Ns/2. In 
the 2D system, the lattice size was fixed to 4 x 4 and the par- 
ticle number was varied, such that there were equal numbers 
of up and down spin electrons. We need memory for two state 
vectors in the Lanczos algorithm, which limits us to systems of 
up to 16 lattice sites. We benchmarked the GPU program with 
both single and double precision arithmetic. The CPU was run- 
ning in double precision, but the results for single precision are 
essentially identical. 

The execution times for the ID system can be seen in Figure 
[3] The CPU shows an expected exponential growth of the exe- 
cution time as a function of the system size. The GPU, on the 
other hand, runs in more or less constant time until the system 
size reaches 10 lattice sites, and for larger systems shows the 
same exponential scaling as the CPU. 

To understand the behavior, we have profiled our application 
with the CUDA Visual Profiler For system sizes below 10, we 
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Figure 3: The execution times of a single iteration of the Lanczos 
algorithm for the ID system as a function of the system size. 
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Figure 4: The speedup factor of the GPU implementation running in 
single and double precision as a function of the system size for the ID 
system. 



find that the achieved occupancy of the multiprocessors is very 
low: 16% for 8 sites and less for the smaller systems. Also, the 
profiler reports very low values of below 0.5 for the instructions 
per clock cycle (IPC) statistic. These suggest that the explana- 
tion for the results of Figure|3]is that the smaller systems cannot 
saturate the GPU, and its resources are not fully utilized. Once 
the GPU is fully occupied, the degree of parallelism reaches 
its maximum, and any additional workload has to be serialized. 
This is reflected in IPC values of over 1 .5 for systems larger that 
10 sites, which is reasonably close to the maximum IPC value 
of 2. Also, the occupancy is over 60%, which is enough to hide 
the memory latency with computation. 

The speedup factor, calculated as the ratio of the CPU and 
GPU execution times, can be seen in Figure |4] The GPU over- 
takes the CPU in speed when the system size is 8 lattice sites. 
After a rapid growth, the speedup stabilizes to around 100 for 
single precision and 60 for double precision after 12 sites. In 
both cases, the maximum speedup is achieved with a system 
size of 14, where the speedups are 105 and 64 for single and 
double precision, respectively. 

In the square lattice, the situation is qualitatively similar to 
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Figure 5: The execution times of a single iteration of the Lanczos 
algorithm for the 2D system as a function of the number of electrons. 
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Figure 6: The speedup factor of the GPU implementation mnning in 
single and double precision as a function of the number of electrons in 
the 4x4 square lattice. 

the ID case. As we increase the number of electrons, the 
CPU execution time increases rapidly with the dimension of 
the Hamiltonian (Figure|5]l. The GPU, on the other hand, scales 
much better for small numbers of electrons, and starts to follow 
the CPU scaling as the basis grows large enough. The GPU is 
faster when there are more than two electrons. The maximum 
speedups are achieved with 10 and 8 electrons, and they are 118 
and 71 for single and double precision, respectively (Figure|6]l. 

6. Conclusions 

We have performed the exact diagonalization of the Hub- 
bard model by implementing the Lanczos algorithm on a GPU, 
which was programmed with the CUDA programming model. 
The core of the program is the kernel that computes the result 
of the Hamiltonian matrix operating on a state vector, which is 
the most computationally demanding operation in many calcu- 
lations, including the Lanczos algorithm and, for example, time 
propagation. 

The program was benchmarked against a single-core CPU 
program in two lattices, a one-dimensional ring and a two- 



dimensional square lattice. The GPU was found to be faster 
when the basis size was large enough to allow for fully utiliz- 
ing the resources of the GPU. In the ring, speedups of over 100 
and 60 were found in single and double precision runs, respec- 
tively. In the square lattice, the corresponding speedups were 
over 110 and 70. The main conclusion from this work is that 
GPUs are well-suited for exact diagonalization calculations and 
significant speedups can be obtained very cost-effectively. 
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